SSIS
SSIS
provides a robust means to move data between sources and targets. Data
can be exported, validated, cleaned up, consolidated, transformed, and
then imported into a destination of any kind. With any OLAP/SSAS
implementation, you will undoubtedly have to transform, clean, or
preprocess data in some way. You can now tap into SSIS capabilities
from within the SSAS platform.
You can combine
multiple column values into a single calculated destination column or
divide column values from a single source column into multiple
destination columns. You might need to translate values in operational
systems. For example, many OLTP systems use product codes stored as
numeric data. Few people are willing to memorize an entire collection
of product codes. An entry of 100235 for a type of shampoo in a product
dimension table is useless to a vice president of marketing who is
interested in how much of that shampoo was sold in California in the
past quarter.
Cleanup and validation of
data are critical to the data’s value in the data warehouse. The old
saying “garbage in, garbage out” applies. If data is missing,
redundant, or inconsistent, high-level aggregations can be inaccurate,
so you should at least know that these conditions exist. Perhaps data
should be rejected for use in the warehouse until the source data can
be reconciled. If the shampoo of interest to the vice president is
called Shamp in one database and Shampoo in another, aggregations on
either value would not produce complete information about the product.
The SSIS packages define
the steps in a transformation workflow. You can execute the steps
serially and in combinations of serially, in parallel, or
conditionally.
OLAP Performance
Performance is a big emphasis of
SSAS. Usage-based aggregation is at the heart of much of what you can
do to help in this area. In addition, the proactive caching mechanism
in SSAS has allowed much of what was previously a bottleneck (and a
slowdown) to be circumvented.
When designing cubes
for deployment, you should consider the data scope of all the data
accesses (that is, all the OLAP queries that will ever touch the cube).
You should only build a cube that is big enough to handle these known
data scopes. If you don’t have requirements for something, you
shouldn’t build it. This helps keep things a smaller, more manageable
size (that is, smaller cubes), which translates into faster overall
performance for those who use the cube.
You can also take caching
to the extreme by relocating the OLAP physical storage components on a
solid-state disk device (that is, a persistent memory device). This can
give you tenfold performance gains. The price of this type of
technology has been dramatically reduced
within the past year or so, and the ease of transparently applying this
type of solution to OLAP is a natural fit. It affects both the OLAP
data population process and the day-to-day what-if usage by the end
users. You should keep these types of surgical incisions in mind when
you face OLAP performance issues in this platform. They are easy to
apply, the gains are huge, and you quickly get a return on your
investment.
MPP Data Warehouse Option from Microsoft
A few years ago,
Microsoft acquired DATAllegro’s massively parallel data warehouse
appliance company. This basically lifted any limitations for data
warehousing that SSAS or SQL Server 2008 R2 itself had. Massively parallel
means to scale horizontally on CPU and storage to grow with your size
and processing needs. There is no practical limit here. The underlying
architecture relies on standards-based technologies. Essentially, there
is a separation of storage and compute nodes that allows you to spread
out your data across vast storage (EMC storage) so that it is very
shallow (easy to get to quickly across all data storage). The compute
power is also horizontally scalable and allows any query to process
data access in parallel to surface data needed by any query (and
assemble it for delivery). Figure 65 shows the high-level architecture of Microsoft’s DATAllegro v3 offering.
Not only is the DATAllegro v3
architecture massively parallel and fast, but the multinode
architecture also makes it highly available. If any node fails, hot
spares kick in to pick up the load. Any failed node can easily be
replaced and brought online with zero processing interruption.
Moreover, multiple appliances can be combined on a common InfiniBand
backbone to create large-scale and extremely powerful multitier or
hub-and-spoke data warehouses with rapid, parallel data movement
between the various appliances. Believe it or not, there is an Ingres
SQL engine at the heart of the database portion of this appliance.
Master Data Services
Completing
the business intelligence picture is a new focus on the data quality
that is needed at all tiers of data information delivery. Microsoft has
been pouring an enormous amount of effort (and money) into creating and
embedding master data services throughout its BI and transactional
platforms. By using Microsoft’s Master Data Services, organizations can
align operational and analytical data across the enterprise and across
lines of business systems with a guaranteed level of data quality for
most core data categories (such as customer data, product data, and
other core data of the business).
Microsoft has created
data stewardship capabilities complete with workflows and notifications
of any business user who might be impacted by core data change.
Managing hierarchies is also an important part of mastering data that
has a natural hierarchical structure, such as customer hierarchies
(parent company to subsidiaries and so on). Each master data change
within the system is treated as a transaction; and the user, date, and
time of each change are logged, as well as pertinent audit details,
such as type of change, member code, and prior versus new value. In
addition to being a very useful audit trail, the transaction log can be
used to selectively reverse changes. Customizable data quality rules
create default values, enable data validation, and trigger actions such
as email notifications and workflows. Rules can be built by IT
professionals or business users directly from the stewardship portal.
Microsoft is still getting
the kinks out of Master Data Services, so you should look for much
maturing to come in the next few years. Other competing products that
have many years’ headstart provide this capability to companies around
the globe, but Microsoft is catching up fast.